{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# task3:Faster 情感分析\n", "\n", "上一章中我们已经介绍了基于RNN的升级版本的情感分析,在这一小节中,我们将学习一种不使用RNN的方法:我们将实现论文 [Bag of Tricks for Efficient Text Classification](https://arxiv.org/abs/1607.01759)中的模型,该论文已经放在了教程中,感兴趣的小伙伴可以参考一下。这个简单的模型实现了与第二章情感分析相当的性能,但训练速度要快得多。" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 3.1 数据预处理\n", "\n", "FastText分类模型与其他文本分类模型最大的不同之处在于其计算了输入句子的n-gram,并将n-gram作为一种附加特征来获取局部词序特征信息添加至标记化列表的末尾。n-gram的基本思想是,将文本里面的内容按照字节进行大小为n的滑动窗口操作,形成了长度是n的字节片段序列,其中每一个字节片段称为gram。具体而言,在这里我们使用bi-grams。\n", "\n", "例如,在句子“how are you ?”中,bi-grams 是:“how are”、“are you”和“\"you ?”。\n", "\n", "“generate_bigrams”函数获取一个已经标注的句子,计算bigrams并将其附加到标记化列表的末尾。" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "def generate_bigrams(x):\n", " n_grams = set(zip(*[x[i:] for i in range(2)]))\n", " for n_gram in n_grams:\n", " x.append(' '.join(n_gram))\n", " return x" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "例子:" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['This', 'film', 'is', 'terrible', 'film is', 'This film', 'is terrible']" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "generate_bigrams(['This', 'film', 'is', 'terrible'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "TorchText 'Field' 中有一个`preprocessing`参数。此处传递的函数将在对句子进行 tokenized (从字符串转换为标token列表)之后,但在对其进行数字化(从tokens列表转换为indexes列表)之前应用于句子。我们将在这里传递`generate_bigrams`函数。\n", "\n", "由于我们没有使用RNN,所以不需要使用压缩填充序列,因此我们不需要设置“include_length=True”。" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/home/ben/miniconda3/envs/pytorch17/lib/python3.8/site-packages/torchtext-0.9.0a0+c38fd42-py3.8-linux-x86_64.egg/torchtext/data/field.py:150: UserWarning: Field class will be retired soon and moved to torchtext.legacy. Please see the most recent release notes for further information.\n", " warnings.warn('{} class will be retired soon and moved to torchtext.legacy. Please see the most recent release notes for further information.'.format(self.__class__.__name__), UserWarning)\n", "/home/ben/miniconda3/envs/pytorch17/lib/python3.8/site-packages/torchtext-0.9.0a0+c38fd42-py3.8-linux-x86_64.egg/torchtext/data/field.py:150: UserWarning: LabelField class will be retired soon and moved to torchtext.legacy. Please see the most recent release notes for further information.\n", " warnings.warn('{} class will be retired soon and moved to torchtext.legacy. Please see the most recent release notes for further information.'.format(self.__class__.__name__), UserWarning)\n" ] } ], "source": [ "import torch\n", "from torchtext.legacy import data\n", "from torchtext.legacy import datasets\n", "\n", "SEED = 1234\n", "\n", "torch.manual_seed(SEED)\n", "torch.backends.cudnn.deterministic = True\n", "\n", "TEXT = data.Field(tokenize = 'spacy',\n", " tokenizer_language = 'en_core_web_sm',\n", " preprocessing = generate_bigrams)\n", "\n", "LABEL = data.LabelField(dtype = torch.float)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "与前面一样,加载IMDb数据集并创建拆分:" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/home/ben/miniconda3/envs/pytorch17/lib/python3.8/site-packages/torchtext-0.9.0a0+c38fd42-py3.8-linux-x86_64.egg/torchtext/data/example.py:78: UserWarning: Example class will be retired soon and moved to torchtext.legacy. Please see the most recent release notes for further information.\n", " warnings.warn('Example class will be retired soon and moved to torchtext.legacy. Please see the most recent release notes for further information.', UserWarning)\n" ] } ], "source": [ "import random\n", "\n", "train_data, test_data = datasets.IMDB.splits(TEXT, LABEL)\n", "\n", "train_data, valid_data = train_data.split(random_state = random.seed(SEED))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "构建vocab并加载预训练好的词嵌入:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "MAX_VOCAB_SIZE = 25_000\n", "\n", "TEXT.build_vocab(train_data, \n", " max_size = MAX_VOCAB_SIZE, \n", " vectors = \"glove.6B.100d\", \n", " unk_init = torch.Tensor.normal_)\n", "\n", "LABEL.build_vocab(train_data)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "创建迭代器:" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/home/ben/miniconda3/envs/pytorch17/lib/python3.8/site-packages/torchtext-0.9.0a0+c38fd42-py3.8-linux-x86_64.egg/torchtext/data/iterator.py:48: UserWarning: BucketIterator class will be retired soon and moved to torchtext.legacy. Please see the most recent release notes for further information.\n", " warnings.warn('{} class will be retired soon and moved to torchtext.legacy. Please see the most recent release notes for further information.'.format(self.__class__.__name__), UserWarning)\n" ] } ], "source": [ "BATCH_SIZE = 64\n", "\n", "device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')\n", "\n", "train_iterator, valid_iterator, test_iterator = data.BucketIterator.splits(\n", " (train_data, valid_data, test_data), \n", " batch_size = BATCH_SIZE, \n", " device = device)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 3.2 构建模型\n", "\n", "FastText是一种典型的深度学习词向量的表示方法,通过将Embedding层将单词映射到稠密空间,然后将句子中所有单词在Embedding空间中进行平均,进而完成分类。所以这个模型参数量相较于上一章中的模型会减少很多。\n", "\n", "具体地,它首先使用'Embedding'层(蓝色)计算每个词嵌入,然后计算所有词嵌入的平均值(粉红色),并通过'Linear'层(银色)将其输入。\n", "\n", "![](assets/sentiment8.png)\n", "\n", "我们使用二维池化函数“avg_pool2d”实现单词在Embedding空间中的平均化。我们可以将词嵌入看作为一个二维网格,其中词沿着一个轴,词嵌入的维度沿着另一个轴。下图是一个转换为5维词嵌入的示例句子,词沿纵轴,嵌入沿横轴。[4x5] tensor中的每个元素都由一个绿色块表示。\n", "\n", "![](assets/sentiment9.png)\n", "\n", "“avg_pool2d”使用大小为“embedded.shape[1]”(即句子长度)乘以1的过滤器。下图中以粉红色显示。\n", "\n", "![](assets/sentiment10.png)\n", "\n", "\n", "我们计算filter 覆盖的所有元素的平均值,然后filter 向右滑动,计算句子中每个单词下一列嵌入值的平均值。\n", "\n", "![](assets/sentiment11.png)\n", "\n", "每个filter位置提供一个值,即所有覆盖元素的平均值。filter 覆盖所有嵌入维度后,会得到一个[1x5] 的张量,然后通过线性层进行预测。" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "import torch.nn as nn\n", "import torch.nn.functional as F\n", "\n", "class FastText(nn.Module):\n", " def __init__(self, vocab_size, embedding_dim, output_dim, pad_idx):\n", " \n", " super().__init__()\n", " \n", " self.embedding = nn.Embedding(vocab_size, embedding_dim, padding_idx=pad_idx)\n", " \n", " self.fc = nn.Linear(embedding_dim, output_dim)\n", " \n", " def forward(self, text):\n", " \n", " #text = [sent len, batch size]\n", " \n", " embedded = self.embedding(text)\n", " \n", " #embedded = [sent len, batch size, emb dim]\n", " \n", " embedded = embedded.permute(1, 0, 2)\n", " \n", " #embedded = [batch size, sent len, emb dim]\n", " \n", " pooled = F.avg_pool2d(embedded, (embedded.shape[1], 1)).squeeze(1) \n", " \n", " #pooled = [batch size, embedding_dim]\n", " \n", " return self.fc(pooled)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "与前面一样,创建一个'FastText'类的实例:" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "INPUT_DIM = len(TEXT.vocab)\n", "EMBEDDING_DIM = 100\n", "OUTPUT_DIM = 1\n", "PAD_IDX = TEXT.vocab.stoi[TEXT.pad_token]\n", "\n", "model = FastText(INPUT_DIM, EMBEDDING_DIM, OUTPUT_DIM, PAD_IDX)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "查看模型中的参数数量,我们发现该参数与第一节中的标准RNN大致相同,只有前一个模型的一半。" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The model has 2,500,301 trainable parameters\n" ] } ], "source": [ "def count_parameters(model):\n", " return sum(p.numel() for p in model.parameters() if p.requires_grad)\n", "\n", "print(f'The model has {count_parameters(model):,} trainable parameters')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "预训练好的向量复制到嵌入层:" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "tensor([[-0.1117, -0.4966, 0.1631, ..., 1.2647, -0.2753, -0.1325],\n", " [-0.8555, -0.7208, 1.3755, ..., 0.0825, -1.1314, 0.3997],\n", " [-0.0382, -0.2449, 0.7281, ..., -0.1459, 0.8278, 0.2706],\n", " ...,\n", " [-0.1606, -0.7357, 0.5809, ..., 0.8704, -1.5637, -1.5724],\n", " [-1.3126, -1.6717, 0.4203, ..., 0.2348, -0.9110, 1.0914],\n", " [-1.5268, 1.5639, -1.0541, ..., 1.0045, -0.6813, -0.8846]])" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pretrained_embeddings = TEXT.vocab.vectors\n", "\n", "model.embedding.weight.data.copy_(pretrained_embeddings)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "将未知tokens和填充tokens的初始权重归零:" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "UNK_IDX = TEXT.vocab.stoi[TEXT.unk_token]\n", "\n", "model.embedding.weight.data[UNK_IDX] = torch.zeros(EMBEDDING_DIM)\n", "model.embedding.weight.data[PAD_IDX] = torch.zeros(EMBEDDING_DIM)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 3.3 训练模型" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "训练模型与上一节完全相同。\n", "\n", "初始化优化器:" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": [ "import torch.optim as optim\n", "\n", "optimizer = optim.Adam(model.parameters())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "定义标准并将模型和标准放置在GPU上:" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [], "source": [ "criterion = nn.BCEWithLogitsLoss()\n", "\n", "model = model.to(device)\n", "criterion = criterion.to(device)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "精度函数的计算:" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [], "source": [ "def binary_accuracy(preds, y):\n", " \"\"\"\n", " Returns accuracy per batch, i.e. if you get 8/10 right, this returns 0.8, NOT 8\n", " \"\"\"\n", "\n", " #round predictions to the closest integer\n", " rounded_preds = torch.round(torch.sigmoid(preds))\n", " correct = (rounded_preds == y).float() #convert into float for division \n", " acc = correct.sum() / len(correct)\n", " return acc" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "定义了一个函数来训练模型。\n", "\n", "**注**:因为我们不会使用dropout,因此实际上我们不需要使用`model.train()`,但为了保持良好的代码习惯,在这还是保留此行代码。" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [], "source": [ "def train(model, iterator, optimizer, criterion):\n", " \n", " epoch_loss = 0\n", " epoch_acc = 0\n", " \n", " model.train()\n", " \n", " for batch in iterator:\n", " \n", " optimizer.zero_grad()\n", " \n", " predictions = model(batch.text).squeeze(1)\n", " \n", " loss = criterion(predictions, batch.label)\n", " \n", " acc = binary_accuracy(predictions, batch.label)\n", " \n", " loss.backward()\n", " \n", " optimizer.step()\n", " \n", " epoch_loss += loss.item()\n", " epoch_acc += acc.item()\n", " \n", " return epoch_loss / len(iterator), epoch_acc / len(iterator)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "定义了一个函数来测试训练好的模型。\n", "\n", "**注意**:同样,我们也保留`model.eval()`" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [], "source": [ "def evaluate(model, iterator, criterion):\n", " \n", " epoch_loss = 0\n", " epoch_acc = 0\n", " \n", " model.eval()\n", " \n", " with torch.no_grad():\n", " \n", " for batch in iterator:\n", "\n", " predictions = model(batch.text).squeeze(1)\n", " \n", " loss = criterion(predictions, batch.label)\n", " \n", " acc = binary_accuracy(predictions, batch.label)\n", "\n", " epoch_loss += loss.item()\n", " epoch_acc += acc.item()\n", " \n", " return epoch_loss / len(iterator), epoch_acc / len(iterator)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "通过函数得到一个epoch需要多长时间:" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [], "source": [ "import time\n", "\n", "def epoch_time(start_time, end_time):\n", " elapsed_time = end_time - start_time\n", " elapsed_mins = int(elapsed_time / 60)\n", " elapsed_secs = int(elapsed_time - (elapsed_mins * 60))\n", " return elapsed_mins, elapsed_secs" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "最后,训练我们的模型:" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/home/ben/miniconda3/envs/pytorch17/lib/python3.8/site-packages/torchtext-0.9.0a0+c38fd42-py3.8-linux-x86_64.egg/torchtext/data/batch.py:23: UserWarning: Batch class will be retired soon and moved to torchtext.legacy. Please see the most recent release notes for further information.\n", " warnings.warn('{} class will be retired soon and moved to torchtext.legacy. Please see the most recent release notes for further information.'.format(self.__class__.__name__), UserWarning)\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Epoch: 01 | Epoch Time: 0m 7s\n", "\tTrain Loss: 0.688 | Train Acc: 61.31%\n", "\t Val. Loss: 0.637 | Val. Acc: 72.46%\n", "Epoch: 02 | Epoch Time: 0m 6s\n", "\tTrain Loss: 0.651 | Train Acc: 75.04%\n", "\t Val. Loss: 0.507 | Val. Acc: 76.92%\n", "Epoch: 03 | Epoch Time: 0m 6s\n", "\tTrain Loss: 0.578 | Train Acc: 79.91%\n", "\t Val. Loss: 0.424 | Val. Acc: 80.97%\n", "Epoch: 04 | Epoch Time: 0m 6s\n", "\tTrain Loss: 0.501 | Train Acc: 83.97%\n", "\t Val. Loss: 0.377 | Val. Acc: 84.34%\n", "Epoch: 05 | Epoch Time: 0m 6s\n", "\tTrain Loss: 0.435 | Train Acc: 86.96%\n", "\t Val. Loss: 0.363 | Val. Acc: 86.18%\n" ] } ], "source": [ "N_EPOCHS = 5\n", "\n", "best_valid_loss = float('inf')\n", "\n", "for epoch in range(N_EPOCHS):\n", "\n", " start_time = time.time()\n", " \n", " train_loss, train_acc = train(model, train_iterator, optimizer, criterion)\n", " valid_loss, valid_acc = evaluate(model, valid_iterator, criterion)\n", " \n", " end_time = time.time()\n", "\n", " epoch_mins, epoch_secs = epoch_time(start_time, end_time)\n", " \n", " if valid_loss < best_valid_loss:\n", " best_valid_loss = valid_loss\n", " torch.save(model.state_dict(), 'tut3-model.pt')\n", " \n", " print(f'Epoch: {epoch+1:02} | Epoch Time: {epoch_mins}m {epoch_secs}s')\n", " print(f'\\tTrain Loss: {train_loss:.3f} | Train Acc: {train_acc*100:.2f}%')\n", " print(f'\\t Val. Loss: {valid_loss:.3f} | Val. Acc: {valid_acc*100:.2f}%')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "获得测试精度(比上一节中的模型训练时间少很多):" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Test Loss: 0.381 | Test Acc: 85.42%\n" ] } ], "source": [ "model.load_state_dict(torch.load('tut3-model.pt'))\n", "\n", "test_loss, test_acc = evaluate(model, test_iterator, criterion)\n", "\n", "print(f'Test Loss: {test_loss:.3f} | Test Acc: {test_acc*100:.2f}%')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 3.3 模型验证" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [], "source": [ "import spacy\n", "nlp = spacy.load('en_core_web_sm')\n", "\n", "def predict_sentiment(model, sentence):\n", " model.eval()\n", " tokenized = generate_bigrams([tok.text for tok in nlp.tokenizer(sentence)])\n", " indexed = [TEXT.vocab.stoi[t] for t in tokenized]\n", " tensor = torch.LongTensor(indexed).to(device)\n", " tensor = tensor.unsqueeze(1)\n", " prediction = torch.sigmoid(model(tensor))\n", " return prediction.item()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "负面评论的例子:" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "2.1313092350011553e-12" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "predict_sentiment(model, \"This film is terrible\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "正面评论的例子:" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "1.0" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "predict_sentiment(model, \"This film is great\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 小结\n", "\n", "在下一节中,我们将使用卷积神经网络(CNN)进行情感分析。\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.10" } }, "nbformat": 4, "nbformat_minor": 4 }